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Learning  Sparse  Feature  Representations  using 
Probabilistic  Quadtrees  and  Deep  Belief  Nets 
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3-  NASA  Advanced  Supercomputing  Division, 
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Abstract.  Learning  sparse  feature  representations  is  a  useful  instru¬ 
ment  for  solving  an  unsupervised  learning  problem.  In  this  paper,  we 
present  three  labeled  handwritten  digit  datasets,  collectively  called  n- 
MNIST.  Then,  we  propose  a  novel  framework  for  the  classification  of 
handwritten  digits  that  learns  sparse  representations  using  probabilistic 
quadtrees  and  Deep  Belief  Nets.  On  the  MNIST  and  n-MNIST  datasets, 
our  framework  shows  promising  results  and  significantly  outperforms  tra¬ 
ditional  Deep  Belief  Networks. 


1  Introduction 

Deep  Learning  has  gained  popularity  over  the  last  decade  due  to  its  ability  to 
learn  data  representations  in  an  unsupervised  manner  and  generalize  to  unseen 
data  samples  using  hierarchical  representations.  The  most  recent  and  best- 
known  Deep  learning  model  is  the  Deep  Belief  Network^.  In  [5],  Deep  Belief 
Networks  have  been  used  for  modeling  acoustic  signals  and  have  been  shown 
to  outperform  traditional  approaches  using  Gaussian  Mixture  Models  for  Auto¬ 
matic  Speech  Recognition  (ASR).  Deep  Belief  Network  is  trained  one  layer  at  a 
time  using  Restricted  Boltzmann  Machines  (RBM) .  A  sparse  feature  learning  al¬ 
gorithm  for  Deep  Belief  Networks  was  proposed  in  |3].  However,  their  work  was 
focused  on  maximization  of  information  content  in  the  learned  representations. 
Restricted  Boltzmann  Machines,  on  the  other  hand,  are  trained  by  minimizing 
a  contrastive  term  in  the  loss  function. 

The  main  contributions  of  our  work  are  twofold  —  (I)  We  first  present  three  la¬ 
beled  handwritten  digit  datasets,  collectively  called  n-MNIST,  created  by  adding 
white  gaussian  noise,  motion  blur  and  reduced  contrast  to  the  original  MNIST 
dataset [4].  (2)  Then,  we  present  a  framework  for  the  classification  of  handwrit¬ 
ten  digits  that  a)  learns  probabilistic  quadtrees  from  the  dataset,  b)  performs 
a  Depth  First  Search  on  the  quadtrees  to  create  sparse  representations  in  the 
form  of  linear  vectors,  and  c)  feeds  the  linear  vectors  into  a  Deep  Belief  Net¬ 
work  for  classification.  On  the  MNIST  and  n-MNIST  datasets,  our  framework 
shows  promising  results  and  significantly  outperforms  traditional  Deep  Belief 
Networks. 
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2  Dataset43 


We  evaluate  our  framework  on  the  MNIST  dataset  [3]  of  handwritten  digits  as 
well  as  three  artificial  datasets  collectively  called  n-MNIST  (noisy  MNIST)  cre¬ 
ated  by  adding  -  (1)  additive  white  gaussian  noise,  (2)  motion  blur  and  (3)  a 
combination  of  additive  white  gaussian  noise  and  reduced  contrast  to  the  MNIST 
dataset.  Some  of  the  images  from  these  datasets  are  shown  in  Figure 


(a)  MNIST  with  Additive  (b)  MNIST  with  Motion  (c)  MNIST  with  AWGN 
White  Gaussian  Noise  Blur  and  reduced  contrast 


Fig.  1:  Example  images  from  the  n-MNIST  dataset  created  as  part  of  the  ex¬ 
periments 


3  Probabilistic  Quadtrees  for  Learning  Sparse  Represen¬ 
tations 

We  propose  a  novel  technique  based  on  probabilistic  quadtrees  that  can  reduce 
the  dimensionality  of  a  dataset  in  a  probabilistically  sound  way.  We  learn  the 
structure  of  the  quadtree  from  the  samples  of  a  dataset.  A  quadtree  splits  each 
image  into  four  equi-sized  windows,  and  then  performs  a  test  of  homogeneity  on 
each  image  window.  If  a  block  meets  the  homogeneity  criterion,  it  is  not  divided 
any  further  into  sub-windows.  If  otherwise,  it  does  not  meet  the  criterion,  it  is 
again  subdivided  into  four  sub-windows,  and  the  test  criterion  is  in  turn  applied 
to  those  smaller  windows.  This  process  is  repeated  on  all  the  sub-windows  until 
each  meets  the  homogeneity  criterion.  The  resulting  data  structure  can  have 
windows  of  several  different  sizes.  The  homogeneity  criterion  can  be  defined 
as  follows  -  Split  a  block  if  the  maximum  value  of  the  block  elements  minus 
the  minimum  value  is  greater  than  a  threshold  t.  Threshold  t  is  specified  as 
a  value  between  0  and  1  (chosen  here  as  0.27  by  experiments).  Denoting  the 
homogeneity  criterion  for  sample  d  as  Hd,  this  can  be  formally  presented  as 
follows: 

^The  datasets  are  available  at  the  web  link  along  with  a  detailed  description  of  the 
methods  and  parameters  used  to  create  them 
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Hd 


true, 

false, 


ifinax(j)  —  min(i)  <  r  |  r  S  [0, 1] 

iGd  i^d 

ifmax(i)  —  min(i)  >  r  |  r  G  [0, 1] 

iGd  iGd 


(1) 


Alternatively,  the  homogeneity  criterion  can  be  considered  proportional  to 
the  standard  deviation  of  the  probability  distribution  of  the  dataset.  So,  higher 
the  standard  deviation,  higher  the  average  texture  of  a  block  and  higher  is  the 
probability  of  the  block  being  divided  into  sub-blocks. 

In  the  learned  quadtree  structure  for  a  given  dataset,  a  node  is  divided  into 
smaller  windows  if  the  homogeneity  criterion  is  not  met  for  any  sample  in  the 
dataset.  The  node  is  not  divided  into  smaller  windows  only  if  the  homogeneity 
criterion  is  met  by  all  samples  in  the  dataset. 

We  can  consider  each  node  of  the  quadtree  as  a  binary  random  variable  X, 
which  can  take  one  of  two  values  1  or  0  based  on  whether  it  is  divided  into  smaller 
windows  or  not.  So,  for  a  total  of  N  samples  in  dataset  D,  the  random  variable 
X  may  take  on  one  of  iV  -|-  1  possible  split  state  values:  one  value  for  each  of  the 
samples  not  meeting  the  homogeneity  criterion,  and  one  value  indicating  that 
all  samples  meet  the  homogeneity  criterion.  This  can  be  formally  presented  as 
follows: 


^  _  I  1,  if3de  D  \  D  =  {do,di,d2,...,dN}r]{Hd  =  false} 

|0,  ii^d  €  D  \  D  =  {do,di,d2, ..., djv}  H  {Hd  =  true} 

Once  learned,  the  probabilistic  quadtree  helps  in  reducing  the  dimensionality 
of  the  data,  which  captures  the  statistics  of  the  training  samples  in  the  dataset. 
A  depth  first  search  on  the  learned  tree  yields  a  linear  vector  that  is  then  fed 
into  an  unsupervised  learning  framework. 

4  Deep  Belief  Network  for  Feature  Learning 

Deep  Belief  Network  (DBN)  consists  of  multiple  layers  of  stochastic,  latent  vari¬ 
ables  trained  using  an  unsupervised  learning  algorithm  followed  by  a  supervised 
learning  phase  using  Feedforward  Backpropagation  Neural  Networks.  In  the  un¬ 
supervised  pre-training  stage,  each  layer  is  trained  using  a  Restricted  Boltzmann 
Machine  (RBM).  Once  trained,  the  weights  of  the  DBN  are  used  to  initialize  the 
corresponding  weights  of  a  Neural  Network  [^.  A  Neural  Network  initialized  in 
this  manner  converges  much  faster  than  an  otherwise  uninitialized  one.  A  DBN 
is  a  graphical  model  [7]  where  neurons  of  the  hidden  layer  are  conditionally 
independent  of  each  other  given  a  particular  configuration  of  the  visible  layer 
and  vice  versa.  A  DBN  can  be  trained  layer-wise  by  iteratively  maximizing  the 
conditional  probability  of  the  input  vectors  or  visible  vectors  given  the  hidden 
vectors  and  a  particular  set  of  layer  weights.  As  shown  in  [1],  this  layer- wise 
training  can  help  in  improving  the  variational  lower  bound  on  the  probability 
of  the  input  training  data,  which  in  turn  leads  to  an  improvement  of  the  over¬ 
all  generative  model.  We  first  provide  a  formal  introduction  to  the  Restricted 
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Boltzmann  Machine.  The  RBM  can  be  denoted  by  the  energy  function: 

E{v,  h)  =  -^  ^  bjhj  -  XI  XI  (3) 

i  j  i  j 

where,  the  RBM  consists  of  a  matrix  of  layer  weights  W  =  {wij)  between  the 
hidden  units  hj  and  the  visible  units  Vi.  The  and  bj  are  the  bias  weights  for 
the  visible  units  and  the  hidden  units  respectively.  The  RBM  takes  the  structure 
of  a  bipartite  graph  and  hence  it  only  has  inter-layer  connections  between  the 
hidden  or  visible  layer  neurons  but  no  intra-layer  connections  within  the  hidden 
or  visible  layers.  So,  the  visible  unit  activations  are  mutually  independent  given 
a  particular  set  of  hidden  unit  activations  and  vice  versa  [S] .  Hence,  by  setting 
either  h  or  v  constant,  we  can  compute  the  conditional  distribution  of  the  other 
as  follows: 


m 


P{hj  =  I|z;)  =  a{bj  + 

(4) 

n 

P{vi  =  I  h)  =  a{ai  + 

(5) 

where,  a  denotes  the  log  sigmoid  function: 

1  ,  -X 

1  +  e  ^ 

(6) 

The  training  algorithm  maximizes  the  expected  log  probability  assigned  to 
the  training  dataset  D.  So  if  the  training  dataset  D  consists  of  the  visible  vectors 
V,  then  the  objective  function  is  as  follows: 

argmaxA  >  \ogP{v) 

kL  L  -I 

vGV 

(7) 

A  Restricted  Boltzmann  Machine  is  trained  using  a  Contrastive  Divergence 
algorithm  [S].  Once  trained  the  DBN  is  used  to  initialize  the  weights  of  a  feed¬ 
forward  backpropagation  neural  network  that  is  then  used  for  classihcation. 


5  Results  and  Comparative  Studies 

Various  network  architectures  along  with  the  test  set  error  for  the  traditional 
DBN  framework  and  the  probabilistic  quadtree  based  framework  on  the  MNIST 
and  the  three  n-MNIST  datasets  are  listed  in  Tables  and  From  the  Tables, 
it  is  evident  that  our  best  performing  network  outperforms  the  best  traditional 
Deep  Belief  Network  on  both  the  MNIST  and  n-MNIST  datasets.  On  the  MNIST 
dataset,  our  best  network  exhibits  a  relative  improvement  of  ~25%  over  the 
traditional  DBN.  For  the  n-MNIST  dataset,  it  provides  a  relative  improvement 
of  ^36%  for  Additive  White  Gaussian  Noise  (AWGN),  ~26%  for  Motion  Blur 
and  ~I2%  for  AWGN  and  Reduced  contrast. 
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MNIST 

n-MNIST  with  AWGN 

Architecture 

(Neurons) 

Test  Error 
DBN(%) 

Test  Error 
Ours(%) 

Test  Error 
DBN(%) 

Test  Error 
Ours(%) 

50-50 

4.64 

2.93 

89.95 

13.41 

100-100 

3.01 

2.45 

91.43 

12.01 

150-150 

2.34 

2.21 

89.95 

13.49 

200-200 

2.08 

1.96 

88.49 

10.56 

250-250 

1.93 

1.83 

88.49 

13.00 

300-300 

2.02 

1.80 

68.18 

11.24 

350-350 

1.96 

1.74 

90.31 

13.15 

400-400 

1.95 

1.67 

49.27 

10.96 

450-450 

1.93 

1.38 

32.26 

12.62 

500-500 

1.86 

1.43 

69.68 

9.93 

Table  1:  Test  Error  of  a  traditional  DBN  and  our  framework  with  various  archi¬ 
tectures  on  MNIST  and  n-MNIST  with  AWGN 


n-MNIST  with 
Motion  Blur 

n-MNIST  with  AWGN 
and  Reduced  Contrast 

Architecture 

(Neurons) 

Test  Error 
DBN(%) 

Test  Error 
Ours(%) 

Test  Error 
DBN(%) 

Test  Error 
Ours(%) 

50-50 

5.64 

4.17 

10.21 

9.29 

100-100 

4.68 

3.31 

9.43 

9.21 

150-150 

3.99 

3.29 

16.40 

9.00 

200-200 

3.74 

3.03 

15.57 

8.79 

250-250 

3.74 

2.60 

52.31 

8.94 

300-300 

3.50 

3.04 

32.29 

8.28 

350-350 

3.82 

2.91 

86.31 

8.90 

400-400 

3.74 

3.01 

68.78 

8.31 

450-450 

3.91 

2.75 

51.32 

8.36 

500-500 

3.66 

2.83 

68.19 

7.84 

Table  2:  Test  Error  of  a  traditional  DBN  and  our  framework  with  various  archi¬ 
tectures  on  n-MNIST  with  Motion  Blur;  and  with  AWGN  and  Reduced  Contrast 


6  Discussion  and  Fnture  Directions 

Our  learning  framework  based  on  probabilistic  quadtrees  significantly  outper¬ 
forms  traditional  Deep  Belief  Networks  on  both  the  MNIST  and  n-MNIST 
datasets.  Probabilistic  quadtrees  help  in  generating  sparse  representations  for 
the  dataset  and  significantly  improve  the  discriminative  power  of  the  framework. 

We  plan  to  investigate  the  use  of  various  pooling  techniques  like  SPM  [9]  as 
well  as  certain  sparse  representations  like  sparse  coding  [TU]  to  handle  n-MNIST. 
Hierarchical  representations  like  Convolutional  DBN  m  are  other  useful  can- 
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didates  for  investigation.  We  believe  that  n-MNIST  will  help  researchers  better 
apply  and  extend  the  research  on  understanding  representations  for  noisy  object 
recognition  datasets. 
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